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Abstract 

We deal with the task of supervised learning if the data is of func- 
tional type. The crucial point is the choice of the appropriate fitting 
method (also called learner). Boosting is a stepwise technique that com- 
bines learners in such a way that the composite - boosted - learner out- 
performs the single learner. This can be done by either reweighting the 
examples or with the help of a gradient descent technique. In this pa- 
per, we explain how to extend Boosting methods to problems that involve 
functional data. 
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1 A Short Introduction to Boosting 

The task is the following: We try to estimate a relationship 

F:X^y (1) 

based on a finite set S — {(xx, yx), . . . , (x n , y n )} c X x y of observations. A 
popular strategy is to fix a class of functions T and to minimize the empirical 
risk 

n 

i=l 

over all elements / G T . Here 

L-.yxy^R (3) 

is a loss function. Sometimes, a regularization term r(f) is added to (2). We call 
fitting methods like this learners. Popular examples for multivariate data are 
trees, support vector machines or smoothing splines. The choice of the learner 
is crucial, as too complex learners lead to overfitting, whilst 'weak' learners 



*TU Berlin - Department of Computer Science and Electrical Engineering, Franklinstr. 
28/29, 10587 Berlin, Germany nkraemerScs.tu-berlin.de 



1 



fail to capture the relevant structure. The term weak learner has its seeds in 
the machine learning literature. In classification problems, a weak learner is a 
learner that is slightly better than random guessing. (The exact definition can 
be found in e.g. [MR03].) For regression problems, we might think of a learner 
that has a high bias compared to its variance, or a learner that has only a few 
degrees of freedom. 

The basic idea of Boosting is to proceed stepwise and to combine weak learners 
in such a way that the composite - boosted - learner 

M 

!m{x) = ^a m -g m (x) (4) 

m— 1 

(or sign(f) for classification problems) performs better than the single weak 
learners g m . The single learners are usually called base learners and M is 
called the number of Boosting iterations. The learners g m and the weights a m 
are chosen adaptively from the data. AdaBoost [FS97] - the first Boosting 
algorithm - is designed for classification problems. It is presented in algorithm 
1. The weak base learner is repeatedly applied to the weighted training sample 
5* . Points which were hard to approximate in step m are given higher weights 
in the next iteration step. For some learners, it is not possible to compute a 



Algorithm 1 AdaBoost 

Input: sample S, weak learner, M, initial weights D\{xi) = 1/n 
for m = 1, . . . , M do 

Fit a function g m to the weighted sample (S, D m ) using the weak learner 
Compute the weighted error 

n 

e m = Y^ D rn(Xi) I {y i ?g m (x i )} ■ 

Set a m = In (i^f ) . 
Update the weights: 

D m+1 (xi) = D m (xi) exp {-a m yig m (xi)) . 

end for 

return h(x) = sign fcm=i a m g m {x)) ■ 



weighted loss. Instead, in each step we draw with replacement a sample of size 
n from S and use the weights D m as probabilities. 

It can be shown [Bre98, Bre99] that Boosting is a forward stage-wise fitting 
method using gradient descent techniques. More precisely, in each step we fit a 
weak learner to Xi and the negative gradient 



Ui 



dL( yi ,f) 
df 



f=f m {Xi) 



(5) 



of the loss function (3). The connection between Boosting and gradient de- 
scent methods has lead to a wide range of new algorithms [FriOl], notably for 
regression problems. Note that if we use the quadratic loss 

L(y,y') = \{y-y')\ 

the negative gradient is simply the vector of residuals, i.e. we iteratively fit the 
residuals using a weak learner. This method is called Z^Boost [BY03, FriOl] 
and is presented in algorithm 2. Boosting with the loss function 



Algorithm 2 L2Boost 

Input: sample S, weak learner, M 

Fit a function gi (x) , using the weak learner and set /i = g\ . 
for m=l,. . . ,M do 

Compute the residuals 

u i Vi fm i. x i ) > i 1 , . . . , 71 . 

Fit a function g m +i to (xi,Ui) by using a weak learner 
Update 

fm+l = frn(x) + g m +l(x) . 

end for 

return f M = E!!=i 9m{x) 



L{y,y') = log (1 + exp(-yy')) , 

is suited for classification problems and called LogitBoost [FHT00](see algorithm 
3) . The function Jm is an estimate of one-half of the log-odds ratio 

llogf ^ = 11* = *) V 
2 6 \1 - P(Y = l\X = x) J 

As a consequence, this classification algorithm also produces estimates of the 
class probabilities P(Y = 1\X = x). Generic Boosting algorithms for a general 
loss function can be found in [BY03, FriOl]. 

How do we obtain the optimal number of Boosting iterations? One possibility 
is to use cross validation. Depending on the data, this can lead to high com- 
putational costs. If we use Li Boost, it is possible to compute the degrees of 
freedom of the Boosting algorithm [BY03, Biih06]. As a consequence, we can 
use model selection criteria as the Akaike Information Criterion (AIC) or the 
Bayesian Information Criterion (BIC). 



Algorithm 3 LogitBoost 



Input: sample S, weak learner, M 

Initialize probabilities pi(xi) = 1/2 and set fo(%) = 

for m=l,. . . ,M do 

Compute weights and negative gradients 

D m (Xi) = p m {Xi) (1 - Pm{Xi)) 
_ Vi -Pm{Xj) . _ , 

iii , i 1 , . . . , n . 

Urn \Xi ) 

Fit a regression function g m to (xi,Ui) by weighted least squares 
Update 



p m +\{xi) = {l + exp/-2f m (xi)))-l. 



end for 
return /m 



2 Functional Data Analysis 

The content of this section is condensed from [RS05] . We speak of functional 
data if the variables that we observe are curves. Let us first consider the case 
that only the predictor samples Xi are curves, that is 

Xi e X = {x : T -> R} . 

Examples for this type of data are time series, temperature curves or near infra 
red spectra. We usually assume that the functions fulfill a regularity condition, 
and in the rest of the paper, we consider the Hilbert space X = L 2 (T) of all 
square-integrable functions T — > M. 

2.1 How to Derive Functions from Observations? 

In most applications, we do not measure a curve, but discrete values of a curve. 
An important step in the analysis of functional data is therefore the transfor- 
mation of the discretized objects to smooth functions. The general approach is 
the following: We represent each example as a linear combination 

k x 

Xi (t) = Y^cuMt) (6) 
i=i 

of a set of base functions ip\, . . . , xpK m ■ The coefficents cu are then estimated 
by using (penalized) least squares. The most frequently used base functions are 
Fourier expansions, B-splines, wavelets and polynomials. A different possibility 



is to derive an orthogonal basis directly from the data. This can be done by 
using functional principal component analysis. 

2.2 Inference from Functional Data 

We only consider linear relationships (1), i.e. in the regression setting (y = K), 
elements / € T = L 2 {T) are assumed to be linear (up to an intercept) and 
continuous. As T is a Hilbert space, it follows that any function / £ T is of the 
form 

f(x(t)) = f3 + j p{t)x{t)dt. (7) 

In the two-class classification setting (y = {±1}), we use sign(f) instead of /. 
As already mentioned in Sect. 1, we estimate / or (3 by minimizing the empirical 
risk (2). Note that this is an ill-posed problem, as there are (in general) infinitely 
many functions f3 that fit the data perfectly. There is obviously a need for 
regularization, in order to avoid overfitting. We can solve this problem by using 
a base expansion of both the predictor variable Xi(t) as in (6) and the function 

(3{t) = (8) 
i=i 

This transforms (2) into a parametric problem. If we use the quadratic loss, 
this is a matrix problem: We set 

C={cij) , J = (J T ipi(t)ipj(t)dt) , Z = CJ. 

It follows that (for centered data) 

I = {Z t Zy X Z t y. (9) 

As already mentioned, we have to regularize this problem. There are two pos- 
sibilities: We can either constrain the number of base functions in (8). That is, 
we demand that Kp <C K x . However, we show in Sect. 3 that this strategy can 
lead to trivial results in the Boosting setting. The second possibility is to add 
a penalty term r(f) to the empirical risk (2). If we consider functional data, it 
is common to use a penalty term of the form 

r(J3) = A J T (P {k) {t)) 2 'dt. 

Here (3^ is the kth derivative of - provided that this derivative exists. The 
choice of k depends on the data at hand and our expert knowledge on the 
problem. 

Finally, let us briefly mention how to model a linear relationship (1) if both the 
predictor and response variable are functional. We consider functions 

f:L 2 (T) - L 2 (T) 

f(x(t)) = a(t)+ J 0(s,t)x(s)ds . 



We estimate j3 by expanding yi,Xi,a in terms of a basis and by representing (3 
by 

k=l 1=1 

The optimal coefficients bki are determined using the loss function 

L(y,y') = J T (y(t)-y'(t)fdt. 

Again, we have to regularize in order to obtain smooth estimates that do not 
overfit. 

3 Functional Boosting 

In order to apply a Boosting technique to functional data, we have to extend 
the notion 'weak learner'. In the classification setting, we can adopt the loose 
definition from Sect. 1. A weak learner is a learner that is slightly better 
than random. What are examples of weak learners? Note that it is possible to 
apply most of the multivariate data analysis tools to functional data. We use 
a finite-dimensional approximation as in (6) and simply apply any appropriate 
algorithm. In this way, it is possible to use stumps (that is, classification trees 
with one node) or neural networks as base learners. 

In the regression setting, we propose the following definition: A weak learner 
is a learner that has only a few degrees of freedom. Examples include the two 
regularized least squares algorithms presented in Sect. 2 - restriction of the 
number of base functions in (8) or addition of a penalty term to (2). Note 
however that the first method leads to trivial results if we use L 2 Boost. The 
learner is simply the projection of y onto the space that is spanned by the 
columns of Z (recall (9)). Consequently, the y-residuals are orthogonal on Z 
and after one step, the Boosting solution does not change anymore. Another 
example of a weak learner is the following [Biih06]: In each Boosting step, we 
only select one base function using Xi and the residuals U{. To select this base 
function, we estimate the regression coefficients bj of 

Ui ~ bj [ Xi {t)^j(t)dt ,j = l,...,Kf). (10) 

JT 

We choose the base function that minimizes the empirical risk (2). For centered 
data, this equals 

m 

bj 



= arg min L \ Uj, bj J tpj(t)xi{t)dt 

i=l ^ ^ 

= Least Squares estimate of bj in (10) 



Boosting for multivariate data with this kind of weak learner has been studied 
in e.g. [Biih06]. 

If the response variable is functional, we can adopt the same definition of weak 
learner as in the regression setting: A weak learner is a learner that uses only a 
few degrees of freedom. 

4 Example: Speech Recognition 

This example is taken from [BBW05]. The data consists of 48 recordings of the 
word 'Yes' and 52 recordings of the word 'No'. One recording is represented 
by a discretized time series of length 8192. The data can be downloaded from 
http://www.math.univ-montp2.fr/~biau/bbwdata.tgz. All calculations are 
performed using R [R D04]. 

The task is to find a classification rule that assigns the correct word to each time 
series. We apply the LogitBoost algorithm to this data set. First, we represent 
the time series in terms of a Fourier basis expansion of dimension K x = 100. 
We opted to include a lot of basis functions, as experiments indicate that the 
results of LogitBoost are insensitive to the addition of possibly irrelevant basis 
functions. The weak learner is a classification tree with two final nodes. The 
misclassification rate was estimated using lOfold cross-validation (cv). Figure 1 
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Figure 1: Cross validated error for the speech recognition problem. The optimal 
number of Boosting iterations is M opt — 24. 

shows the cross-validated error as a function of the number of Boosting itera- 
tions. The minimal cv error over all Boosting iterations is 0.1, obtained after 
24 Boosting iterations. This is the same error rate that is reported in [BBW05]. 
There, a functional fc-nearest-neighbor-algorithm is applied to the data. Fi- 
nally, we remark that the cv error curve stays rather flat after the minimum is 
attained. This seems to be a feature of all Boosting methods. As a consequence, 
the selection of the optimal number of Boosting iterations can be done quite 
easily. 



5 Conclusion 



The extension of Boosting methods to functional data is straightforward. After 
choosing a base algorithm (which we called a weak learner), we iteratively fit 
the data by either applying this algorithm to reweighted samples or by using a 
gradient descent technique. In many applications, we use a finite-dimensional 
expansion of the functional examples in terms of base functions. This finite- 
dimensional representation can then be plugged into existing algorithms as Ad- 
aBoost, LogitBoost or Z^Boost. 

We focused on linear learning problems in Sect. 2 for the sake of simplicity and 
briefness, but it should be noted that Boosting methods can also be applied to 
solve nonlinear functional data problems. 
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